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•^^ . Abstract. Popular sequence alignment tools such as BWA convert a 

O' reference genome to an indexing data structure based on the Burrows- 

Wheeler Transform (BWT), from which matches to individual query 
Q ' sequences can be rapidly determined. However the utility of also indexing 

the query sequences themselves remains relatively unexplored. 
Here we show that an all-against-all comparison of two sequence coUec- 
^^' tions can be computed from the BWT of each collection with the BWTs 

held entirely in external memory, i.e. on disk and not in RAM. As an 
application of this technique, we show that BWTs of transcriptomic and 
genomic reads can be compared to obtain reference-free predictions of 
splice junctions that have high overlap with results from more standard 
reference-based methods. 
1/^ ' Code to construct and compare the BWT of large genomic data sets 

1/^ ! is available at http : //beetl . github . com/BEETL/, as part of the BEETL 

l ' ' library. 
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1 Introduction 

In computer science, a suffix tree is the classical example of an indexing data 

^ . structure which, when built from some text T, allows the presence or absence 

3 I of a query string S' in T to be rapidly determined. A sufhx tree is several times 

■ ■ ■ larger than the text it indexes, but research since 2000 (well summarized in |13| ) 

has led to compressed full-text indexes that provide the same functionality as 
the suffix tree while taking up less space than the text itself. 

Of these, the FM-index has become central to bioinformatics as the com- 
putational heart of popular sequence alignment tools such as BWA [5] , Bowtie 
[5] and S0AP2 [T^. All these programs work in a similar way, with individual 
query sequences being searched for one-by-one in the index of a reference genome. 
The FM-index of, say, the latest human reference sequence can be viewed as a 
constant and precomputed, so the cost of building it is not important for this 
particular use case. 



Constructing the FM-index of T is dominated by the computation of the 
Burrows- Wheeler transform, a permutation of the symbols of T that also has 
widespread applications in data compression [2]- For large T, this calculation 
requires either a large amount of RAM or a cumbersome divide-and-conquer 
strategy. However, in p?|4 . two of the present authors demonstrated that if T 
can be considered to be a large number of independent short patterns then its 
BWT can be built partially or entirely in external memory (that is, by sequential 
access to files held on disk). This leads us to the aim of the present work, which 
is to introduce some of the additional possibilities that arise if the set of query 
sequences is also indexed. 

The search for a single pattern in an FM-index may potentially need access to 
any part of the BWT, requiring the entire BWT to be held in RAM to guarantee 
that this can be efficiently achieved. However, building on our previous work, 
we show that the computations needed for an all-against-all comparison of the 
sequences in two collections can be arranged so that the BWTs of the collections 
are both accessed in a series of sequential passes, permitting the comparison to 
be done efficiently with the BWTs held on disk. In k passes, this procedure 
traverses all fc-mers that are present in one or both of the two indexes. This 
traversal can be viewed as a template upon which different sequence comparison 
tasks can be defined by specifying particular sets of behaviours according to 
whether each of the fc-mers encountered is unique to one or the other dataset, 
or shared by both. To illustrate, we show how this template may be adapted 
to the task of comparing transcriptomic and genomic reads from an individual 
eukaryotic organism to deduce exon-exon splice junctions. 

In a eukaryotic genome, large tracts of intragenic and intronic DNA will be 
represented in the genome but not the transcriptome, but the sequences present 
in the transcriptome alone are far fewer and of more interest: notwithstanding 
experimental artefacts and the relatively rare phenomenon of RNA editing, these 
must span splice junctions between exons. By obtaining the genomic and tran- 
scriptomic samples from the same individual, we eliminate the possibility that 
differences between the datasets are due to genetic variation between individuals. 

We apply our methods to data from the Tasmanian Devil (S. Harrissii) . The 
hypothesis- and reference-free nature of our procedure is advantageous for de 
novo projects where no reference sequence is available or cases where, as here, 
the reference genome is of draft quality. 



2 Methods 

2.1 Definitions 

Consider a string s comprising k symbols from an alphabet S — {ci, C2, . . . , Co-} 
whose members satisfy Ci < C2 < ■ ■ ■ < Ca- We mark the end of s by appending 
a special end marker symbol $ that satisfies $ < ci . We can build k + 1 distinct 
suffixes from s by starting at different symbols of the string and continuing 
rightwards until we reach $. If we imagine placing these suffixes in alphabetical 



order, then the Burrows- Wheeler transform |5|2j of s can be defined such that the 
i-th element of the BWT is the symbol in s that precedes the first symbol of the i- 
th member of this ordered list of suffixes. Each symbol in the BWT therefore has 
an associated suffix in the string. A simple way (but not the only way - see |11| ) 
to generalize the notion of the BWT to a collection of m strings S — {si, . . . , Sm} 
is to imagine that all members Si of the collection are terminated by distinct 
end markers $i such that $i < • • • < $„j < ci. 

The characters of BWT(S') whose associated suffixes start with some string 
Q form a single contiguous substring of BWT(S'). We call this the Q-interval 
of BWT(S') and express it as a pair of coordinates [bQ^eq), where 6q is the 
position of the first character of this substring and eg is the position of the first 
character after it. This definition is closely related to the Icp-interval introduced 
in pQ: the Q-interval is an Icp-interval of length |Q|. If Q is not a substring of 
any member of S then a consistent definition of its Q-interval is [bg, bq), where 
6q is the position Q would take if it and the suffixes of S were to be arranged 
in alphabetical order. 

Each occurrence of some character c in the Q-interval corresponds to an 
occurrence of the string cQ in S. We call cQ a backward extension of Q. If all 
characters in the Q-interval are the same then Q has a unique backward extension, 
which is equivalent to saying that all occurrences of Q in S* are preceded by 
c. A Q-interval of size 1 is a special case of unique backward extension that 
corresponds to there being a unique occurrence of Q in S: we call this a singleton 
backward extension. Similarly, we can say that appending a character c to Q to 
give Qc creates a forward extension of Q. Since all suffixes that start with Qc 
must also start with Q, the Qc-interval is clearly a subinterval of the Q-interval. 



2.2 All-against-all backvirard search 

We can use Q-intervals to describe the backward search algorithm for querying 
BWT(S') to compute occ(P), the number of occurrences of some query string 
P ^ pi ■ ■ -Pn in S. This proceeds in at most n stages. At stage j, let Q be the 
j -suffix {i.e. the last j symbols) of P, let c be the character preceding Q in P 
and assume we know the position of the (non-empty) Q-interval in BWT(S'). 
The number of occurrences of cQ in S is given by the number of occurrences 
of c in the Q-interval (see [I]). If this is zero, then we know that cQ does not 
occur in S and so occ(P) must be zero. Otherwise, we observe that the number 
of occurrences of cQ in S is by definition the size of the cQ-interval. The start 
of the cQ-interval is given by the count of c characters that precede the start of 
the Q-interval. These are the two pieces of information we need to specify the 
cQ-interval that we need for the next iteration. At the last stage, the count of 
pi characters in the p2 ■ ■ -pn-interval gives occ(P). 

Running this procedure to completion for a single query P entails counting 
symbols in intervals that can potentially lie anywhere in BWT(5'), the whole 
of which must therefore reside in RAM if we are to guarantee this can be done 
efficiently. However, we show that occ(P) may be computed for all strings P of 



length k or less in 5* by making k sequential passes through BWT(S'), allowing 
the processing to be done efRciently with BWT(5) held on disk. 

At the start of iteration j, we open an array _F of cr write-only files and set 
each entry of an array U oi a counters to zero. During the iteration, we apply 
the processIntervalO function described in Figure [1] to the Q-intervals of all 
_7-suSixes of S which, by induction, we assume are available in lexicographic 
order. With this ordering, the intervals \hQ,eQ), [^Q't^Q') of t'^o consecutive 
j-sufRxes Q, Q' satisfy eg < &q', which means that we can update the counters 
n and TT needed by processIntervalO by reading the symbols of BWT(iS') 
consecutively. 

These arrays simulate the rank() function used in the FM-index - II[i] holds 
rank(ci, bg), the number of occurrences oi Ci prior to the start of the bg, whereas 
Tr[i] counts the occurrences of c^ in the Q-interval [bq, eg) itself. The pair {n[i], 7r[i] 
that is appended to the file F[i] specifies the start position and size of the CiQ 
interval. The last act of processIntervalO is to update 77 to count the occur- 
rences of each symbol up to position eg and return the updated array ready to 
be passed in at the next call to the function. 

At the end of the iteration, each file F[i] contains the Q- intervals of all (j-f 1)- 
suffixes that start with symbols Ci, in lexicographic order. If we consider the files 
in the order F[l], F[2], . . . , F[<t] and read the contents of each sequentially, we 
have the lexicographic ordering of the {j + l)-suffixes that we need for the next 
iteration. 



function processInterval([6q, eg), 73,77, _?) 

Update 77 if necessary so that each 77[i] counts occurrences of d in B[0, bq) 
Create tt such that each 7r[i] counts occurrences of Ci in 73 [6q, cq) 
for i — 1 ^ a do 
if n\i] > then 

Write [77[i],7r[i]) to file 7^ [i] 
end if 
end for 
77 ^77 + 7r 
return 77 
end function 

Fig. 1. Given a Q-interval [bQ,eQ) of Q in a BWT string 73, the function 
processIntervalO computes the CiQ-intervals for all backward extensions CiQ of Q 
that are present in [bQ,eQ) and appends them to the appropriate file 7^[i] ready for 
processing during the next iteration. 



2.3 All-against-all comparison of tvifo BWTs 

The concept of all-against-all backward search can be extended to compute the 
union of all suffixes of length k or less present in two collections 5*^1 and Sb by 



while (1) do 

while (gotQ == true) do 

gotQ = getNextInterval(6Q, eg, rriQ, BWT(5'a)) 
if (gotQ —— false) or (mg —— true) then 

break 
end if 

doAOnlyBehaviour() 

processInterval(feQ, CQ, BWT(S'a), Ua, Fa) 
for i = 1 — 7- a do 

if {piA[i] > 0) then 

Write false to file MA[i] 
end if 
end for 
end while 
while (gotR == true) do 

gotR = getNextInterval(bij, cr, niR, BWT(S'b)) 
if (gotR —— false) or (m/} == true) then 

break 
end if 

doBOnlyBehaviour() 

processInterval(feii, cr, BWT(S'b), IIb, Fb) 
for i = 1 -T- a do 

if (p«s[i] > 0) then 

Write false to file MB[i\ 
end if 
end for 
end while 
if (gotQ —— false) then 

break 
end if 

doSharedBehaviour ( ) 

processInterval(feQ, eg, BWT(S'a), Ha, Fa) 
processInterval(fe_R, cr, BWT(S's), 77_g, Fb) 
for i = 1 — > (7 do 

if {pi A [i] > 0) and (pis [i] > 0) then 

Write true to files Ma [i] , Mb [i] 
else if {piA[i] > 0) then 

Write false to file MaIi] 
else if {piB[i] > 0) then 

Write false to file MB[i] 
end if 
end for 
end while 

Fig. 2. Pseudocode for stage j of the all- against- all comparison of the BWTs of the 
collections Sa and Sb- Consecutive calls to function getNextInterval() are assumed 
to populate bq, eq and mg with details of the Q-intervals of the j-suffixes of the 
relevant BWT in lexicographic order, returning false once the list of intervals has been 
exhausted. In practice, these intervals are read sequentially from the sets of files Fa, 
Fb , Ma and Mb that were generated during the previous execution of this procedure. 



making k passes through thch BWTs. Figure[2]describes the logic of a single pass. 
Conceptually, each pass is a simple merge of the two lists of Q-intervals of a given 
length, with the ordering of Q-intervals being determined by the lexicographic 
ordering of their associated suffixes Q. The notable implementation detail is that 
associating an additional bit mg with each Q-interval avoids the need to store 
and compare strings when deciding whether Q-intervals from the two collections 
are associated with the same suffix. As in the previous section, this insight is 
best understood inductively: if a Q-interval is shared (which we know because 
mq is set to true), then any common backward extensions cQ must also be 
common to both collections (and nicQ must be set to true to reflect that). 

Each suffix Q we encounter during the execution of the algorithm in Figure [2] 
is either present in Sa only, present in Sb only, or common to both Sa and Sb 
and the algorithm calls different functions in the event of these three possibil- 
ities. We show how to specify the behaviour of these three functions so as to 
compute differences between genomic and transcriptomic sequence data from an 
individual eukaryotic organism. 

Figure [3] shows a very simple example of how splicing in the transcriptome 
might give rise to a read set T containing sequence that is not present in the 
genomic reads G. Figure 2] shows the BWTs of the two datasets. In the func- 
tion doSharedBehaviourO, we look for Q-intervals shared by BWT(r) and 
BWT(G') for which the Q-interval in BWT(G') has a unique backward extension 
but the corresponding Q-interval in BWT(T) exhibits significant evidence of 
one or more different backward extensions cQ. In our implementation, spurious 
junction predictions due to sequencing error are guarded against by ignoring 
any such backward extensions for which the number of occurrences (given by 
the number of c symbols present in the Q-interval) fails to exceed a threshold. 
Any T-only cQ-intervals that do pass this test are backward-extended in subse- 
quent intervals by doAOnlyBehaviour() until a string CQ is obtained for which 
the size of the CQ-interval fails to exceed a threshold t, which is equivalent to 
demanding that CQ must occur at least t times in T. The aim of this exten- 
sion is to accumulate as much sequence context as possible to the left of the 
putative exon/exon junction. In a similar way, we could improve specificity by 
allowing doBOnlyBehaviourO to extend G-only intervals and thus accumulate 
sequence context that reaches into the separating intron, although our current 
implementation does not do this. 

If the sequence context to the right of a predicted junction is a prefix of the 
sequence that lies to the right of another prediction junction, then the former 
prediction is subsumed into the latter. For example, in Figure [3] the same splice 
junction gives rise to reads TCACA and CACAT with rightward contexts ACA and 
ACAT: the former is a prefix of the latter. This aggregation of predictions is 
conveniently done by making a single pass through the final list of predictions 
once they have been sorting in lexicographic order of their rightmost context. 
As well as removing repeated predictions, this acts as a further guard against 
false positives - we discard any predictions whose contexts cannot be rightward- 
extended in this way. 



Finally, we note that the double-stranded nature of DNA is handled by ag- 
gregating the individual chromosomal sequences plus their reverse complements 
into a sequence collection and building the BWT of that. 
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Fig. 3. Simple example of genome/transcriptome comparison. In the genome (below 
the line), the exons ATTC and ACAT (in bold) are separated by an intron (italics). In 
the transcriptome (above), the splicing together of these exons gives rise to reads 
T = {CACAT, TCACA} containing the exon/exon boundary whereas, in the genome, the 
reads G = {AGACA, GACAT} extend from the ACAT exon into the intron. 
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Fig. 4. Comparison of the BWTs of T and G from Figure [S] During the second 
execution of the procedure in Figure [21 we find the AC-interval in BWT(G') has a 
unique backward extension G, but the corresponding interval is BWT(T) has a different 
backward extension C. This is corroborated (and the sequence context to the right of 
the spUce junction is extended) at steps 3 and 4 when the ACA- and ACAT-intervals 
of the two BWTs are compared. The CA-interval also suggests a divergent backward 
extension, but the lack of a forward extension of CA that is common to both T and 
G means this observation is not corroborated by subsequent executions of the code in 
Figure [2] and is therefore discarded. 



3 Results 

3.1 Reference-free detection of splice junctions 

We tested our approach using data from a recent study [H] during which 1.45 
bilhon genomic reads and 132 miUion RNA-Seq reads, all lOObp in length, were 
obtained from an individual Tasnianian devil. The RNA-Seq library was pre- 
pared from a mixture of mRNA from 11 different tissues to obtain broad cov- 
erage of gene content. The genome of the Tasnianian devil was estimated to be 
between 2.89 and 3.17 gigabase pairs (Gb) in size and is thus comparable in 
size to the human genome. De novo assembly of the genomic reads yielded 3.17 
Gb of sequence with an N50 of 1.85 megabase pairs (Mb). The Ensembl gene 
annotation pipeline was then applied to the assembled contigs: evidence from 
alignment of mammalian EST, protein and RNA-Seq sequences was combined 
and then various gene prediction algorithms were used to refine these alignments 
and to build gene models (more detail on the annotation procedure is given in 
the supplement of [H]). We obtained the most recent version (0.67) of the Tas- 
manian devil gene annotation from the Ensembl FTP site. It contains 20456 
genes which give rise to 187 840 exon junction sites. 

We built BWTs of both the genomic and RNA-Seq read sets using the al- 
gorithms given in [3| and compared them as described in the previous sections. 
The sequences to the right and left of each prediction were aligned to the devil 
assembly using BWA [9] in single-read mode, setting the option to allow up to 10 
candidates for each read. Predictions for which the left and right halves aligned 
to the same contig with appropriate orientation were classified as putative junc- 
tion sites: we obtained 171371 of these. 

We also predicted gene models and junction sites from the same RNA-Seq 
reads using version 2.0.0 of Tophat [T7], which is a popular tool for this task. 
Tophat first aligns reads to a reference genome with the Bowtie2 aligner [8] then 
builds splicing models based on these alignments. The results of our comparison 
are summarized in Table [5] Tophat predicts 120010 junction sites, of which 
66 587 are not contained in the gene annotation. 



Tool 



Junctions predicted 



True positives 



False Negatives 



Sensitivity (%) 



FDR (%) 



BWT 
Tophat 



171 371 
120 010 



93615 
66 587 



94 225 
121 253 



49.84 
35.45 



45.37 
44.51 



Fig. 5. Comparison of junction site predictions. Our approach predicts 171 371 sites 
and Tophat predicts 120 010. Treating the Ensembl annotation as a gold standard, we 
evaluate sensitivity and false discovery rate of each method. The BWT-based approach 
is competitive with the established software Tophat. 



Using the BEDtools software suite [T3] , we identified junction sites that over- 
lap with sites contained in the Ensembl gene annotation. We used default pa- 
rameters, apart from requiring a reciprocal overlap of 90% of the feature length. 



Of the 171371 sites computed by our approach, 94 225 match known Ensembl 
predictions. Manual inspection of the remaining sites revealed that many were 
contained in putative gene annotation derived from EST alignments or from 
ab initio gene recognition algorithms. These putative annotations were not in- 
corporated into the final annotation because of various threshold or partially 
contradicting evidence. They represent nevertheless likely candidates for coding 
regions. The EST alignments cover 48.06 Mb in 22 582 alignments and the ab 
initio predictions cover 44.92 Mb in 44659 regions. Of the 77 756 junction sites 
detected by our method that did not have a counterpart in the Ensembl pre- 
diction, 24 322 did not have a match in the EST alignment data set and 11 168 
did not match coding regions predicted by ab initio algorithms. Taking these 
sets together, we found that only 8 755 out of 171 371 (5.11%) did not have any 
evidence of being in transcribed regions. For Tophat, 53 423 junction sites did 
not have a match in the Ensembl gene annotation. Of these predictions, 14668 
did not have a match in regions covered by EST alignments and 6 227 did not 
have a match in ab initio gene predictions. In sum, 4 732 Tophat predictions 
(3.94%) did not match any potentially coding regions. 

4 Discussion 

In this work, we show that BWTs of transcriptomic and genomic read sets can be 
compared to obtain reference-free predictions of splice junctions that have high 
overlap with results from more standard reference-based methods. Our method 
predicts splice junctions by directly comparing sets of genomic and transcrip- 
tomic reads and can therefore provide orthogonal confirmation of gene predic- 
tions obtained by comparative genomics approaches. A reference sequence is not 
required for the prediction process itself (here we map the predicted junctions 
to the assembly only for comparison purposes) , making the method particularly 
well suited to the analysis of organisms for which no reference genome exists. 

When comparing the performance of our method with Tophat we find that, 
at least on this data, our approach has superior sensitivity and comparable false 
discovery rate. In order to give a strong proof of principle we deliberately avoided 
building any sort of prior information about gene structure into our analysis. In 
contrast, Tophat makes assumptions about the presence of canonical dinucleotide 
motifs at donor/acceptor sites and the relative abundance of isoforms. This is an 
entirely reasonable thing to do, but it is conceivable that Tophat's use of prior 
information might be a disadvantage for this particular dataset as it is not clear 
to what extent these signals are conserved across species and in particular in the 
Tasmanian devil. 

An obvious piece of prior information needed by both Tophat and the En- 
sembl annotation pipeline is of course a reference sequence. Although considered 
to be of 'draft' quality, the Tasmanian devil assembly we used [12] nevertheless 
required not only both considerable computational and manual effort to gener- 
ate but also made use of additional sequencing data in the form of long-insert 
mate pair libraries. 



While the Ensembl annotation pipehne is a robust and well-estabhshed method- 
ology, we note that our implicit treatment of the Ensembl annotation as abso- 
lute truth is an assumption that might be questioned, since the pipeline is being 
applied here to a draft assembly from a relatively poorly-understood genome. 
Nevertheless, we believe that our results do demonstrate that direct comparison 
of BWTs gives results that are biologically credible and that are competitive 
with existing tools. 

The need to sequence the genome as well as the transcriptome means our 
method is unlikely to supplant methods such as Tophat which can operate on 
transcriptome data alone. Comparison of transcriptome to exome data might be 
more practical and have some utility, although it is arguable whether such an 
approach remains hypothesis-free. However in situations where, as here, both 
genome and transcriptome data are available our method may provide valuable 
additional information. Even for a much better characterized genome such as 
human, our algorithm should provide insight into transcription from regions that 
are not well-represented in the reference sequence and might also be a useful tool 
for investigating RNA editing. A further improvement of the method would be 
to used read-pairing information to link junction sites that are present in the 
same read pair and hence in the same transcript. 

Computing the BWTs of the genomic and transcriptome read sets took 
around 6 days and 12 hours of wallclock time respectively, although the method 
employed ran entirely in external memory and so did not require high-end com- 
puting hardware. Moreover, our previous work 3, suggests that these compu- 
tation times could be approximately halved by storing the work files on a SSD 
(flash memory) drive and could be further improved by using a different algo- 
rithm that reduces I/O at the expense of moderate RAM usage. Indeed, one 
could make a case that the cost of BWT computation should not be included in 
the overall compute time, since it is useful in its own right for lossless compres- 
sion of the data [6] and for facilitating other analyses such as de novo assembly 

[HITS]. 

The comparison of BWTs ran in just under 3 days of wallclock time. Again, 
all processing was done in external memory and could be sped up by the use 
of an SSD drive or, alternatively, the sequential nature of the algorithm's I/O 
access would facilitate cache-efficient processing if the BWT files were instead 
held in RAM on a high-end machine. To put these numbers into context, the 
analysis using TopHat took 18.6 hours but this obviously does not include the 
time to assemble and curate the reference genome. 

In addition, our implementation is a proof-of-principle with considerable 
scope for optimization. Future work will focus on such improvements and on 
exploring further applications of the algorithm described in Figure [5] many im- 
portant tasks in sequence analysis can be reinterpreted as a comparison between 
BWTs, not least the comparison of tumour and normal read sets from cancer 
samples and the comparison of reads to a reference sequence. 
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